1 PUMS Multiple Regression: Marital Status

Many recent studies have found that marital status increasingly correlates with socioeconomic status (UPenn Wharton Budget Model, Pew Research, American Enterprise Institute). In particular, these studies have shown that those who get married are increasingly high-income, college-educated individuals who have the financial stability to support a family.

I will use individual-level Census PUMS data to see if these trends also hold in the Bay Area. In this section, I study the relationship between marital status (which I set as my dependent variable) and income and educational attainment (which I set as my independent variables) in the Bay Area. This analysis will be at the PUMA level. I will focus on individuals aged 18 or older, since individuals under 18 rarely marry.

1.1 Data setup

ca_pums = get_pums(variables = c("PUMA", "AGEP", "MAR", "SCHL", "PINCP"),
                   state = "CA",
                   year = 2019,
                   survey = "acs1",
                   recode = T)
saveRDS(ca_pums, file = "ca_pums.rds")
# Get PUMS and PUMAS data

pums_vars_2019 = pums_variables %>%
                 filter(year == 2019, survey == "acs1")
pums_vars_2019_inds = pums_vars_2019 %>%
                      distinct(var_code, var_label, data_type, level) %>%
                      filter(level == "person")
ca_pums = readRDS("ca_pums.rds")
ca_pumas = pumas("CA", cb = T, progress_bar = F)

# Isolate Bay Area PUMS

bay_county_names =
  c(
    "Alameda",
    "Contra Costa",
    "Marin",
    "Napa",
    "San Francisco",
    "San Mateo",
    "Santa Clara",
    "Solano",
    "Sonoma"
  )

ca_counties = counties("CA", cb = T, progress_bar = F)
bay_counties = ca_counties %>%
               filter(NAME %in% bay_county_names)

bay_pumas = ca_pumas %>%
            st_centroid() %>%
            .[bay_counties, ] %>%
            mutate(PUMACE10 = as.numeric(PUMACE10)) %>%
            st_set_geometry(NULL) %>%
            left_join(ca_pumas %>% select(GEOID10)) %>%
            st_as_sf()
bay_pums = ca_pums %>%
           mutate(PUMA = as.numeric(PUMA)) %>%
           filter(PUMA %in% bay_pumas$PUMACE10)

bay_pums
# Clean dataset
# Filter out all children under age 3 (SCHL == bb)
# Convert "character"-encoded variables MAR and SCHL to doubles
# Filter dataset so all individuals are over 18 years old
# Create new variable "married" representing number of ever-married individuals in each PUMA
# Create new variable "college" representing number of college-educated individuals in each PUMA
# Collapse to PUMAs
# Calculate percentage of married and college-educated people in each PUMA
# Additionally calculate weighted mean income in each PUMA

bay_marriage = bay_pums %>%
               filter(SCHL != "bb") %>%
               mutate(MAR = as.numeric(MAR),
                      SCHL = as.numeric(SCHL)) %>%
               filter(AGEP >= 18) %>%
               mutate(married = ifelse(
                        (MAR < 5),
                        PWGTP,
                        0),
                      college = ifelse(
                        (SCHL >= 21),
                        PWGTP,
                        0
                      )
               ) %>%
               group_by(PUMA) %>%
               summarize(mean_personal_income = weighted.mean(PINCP, PWGTP, na.rm = T),
                         perc_married = sum(married, na.rm = T)/
                                        sum(PWGTP, na.rm = T)*100,
                         perc_college = sum(college, na.rm = T)/
                                        sum(PWGTP, na.rm = T)*100)

bay_marriage

1.2 Single regression: income vs. marriage rate

# Regression

model = lm(perc_married ~ mean_personal_income, bay_marriage)
summary(model)
## 
## Call:
## lm(formula = perc_married ~ mean_personal_income, data = bay_marriage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.524  -2.357   1.993   5.483  14.192 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.285e+01  3.725e+00  16.873   <2e-16 ***
## mean_personal_income 5.933e-05  4.962e-05   1.196    0.237    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.471 on 53 degrees of freedom
## Multiple R-squared:  0.02627,    Adjusted R-squared:  0.007898 
## F-statistic:  1.43 on 1 and 53 DF,  p-value: 0.2371
# Scatterplot

ggplot(data = bay_marriage,
       aes(x = mean_personal_income,
           y = perc_married)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

From the regression, it doesn’t seem that personal income itself has an impact on the likelihood of marriage in any given census tract. (An increase in $10,000 in income corresponds to a 0.5 percentage point increase in marriage rate, a small increase indeed.) A scatterplot seems to show a positive correlation between personal income and marriage rates, but the margin of error for this line is high — it’s possible to draw a straight line through the area covered by the margin of error.

1.3 Single regression: educational attainment vs. marriage rate

# Regression

model = lm(perc_married ~ perc_college, bay_marriage)
summary(model)
## 
## Call:
## lm(formula = perc_married ~ perc_college, data = bay_marriage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.4493  -3.0704   0.9118   5.3427  16.4459 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  67.024808   3.568125  18.784   <2e-16 ***
## perc_college  0.001355   0.073618   0.018    0.985    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.584 on 53 degrees of freedom
## Multiple R-squared:  6.394e-06,  Adjusted R-squared:  -0.01886 
## F-statistic: 0.0003389 on 1 and 53 DF,  p-value: 0.9854
# Scatterplot

ggplot(data = bay_marriage,
       aes(x = perc_college,
           y = perc_married)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

It also appears from the regression that having a college education does not have an impact on marriage rates. The scatterplot shows an essentially flat line with a large margin of error, suggesting that it is unlikely the two are related.

1.4 Multiple regression

# Correlation plot

correlationplot = bay_marriage %>%
                  select(mean_personal_income,
                         perc_college,
                         perc_married) %>%
                  cor()
corrplot(correlationplot,
         method = "number",
         type = "upper")

Before interpreting regression results, it is important to note that college education and income are highly correlated, as shown in the correlation plot above. Thus there is bound to be collinearity in this regression, which undermines the model’s predictive power.

# Regression

model = lm(perc_married ~ mean_personal_income + perc_college, bay_marriage)
summary(model)
## 
## Call:
## lm(formula = perc_married ~ mean_personal_income + perc_college, 
##     data = bay_marriage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.4374  -2.4510   0.6059   4.4006  14.0392 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          63.6079529  3.5538365  17.898   <2e-16 ***
## mean_personal_income  0.0003123  0.0001090   2.865   0.0060 ** 
## perc_college         -0.4108149  0.1595885  -2.574   0.0129 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.054 on 52 degrees of freedom
## Multiple R-squared:  0.1363, Adjusted R-squared:  0.1031 
## F-statistic: 4.104 on 2 and 52 DF,  p-value: 0.02213

When regressing both together, it seems that both factors have a significant impact on marriage rates. Increases in mean personal income correspond to increases in marriage rates, as predicted in my hypothesis. Contrary to what the above cited studies find, increases in college attendance correspond to decreases in marriage rates. However, this effect seems to be less pronounced than the income effect.

A few caveats that might explain this discrepancy is ???

2 CBG/Tract/ZCTA-Level Regression: Sleep Deprivation

According to the CDC and a study published in Sleep Medicine, poorer people tend to sleep less than richer people. Additionally, a recent meta-analysis in Sleep Health finds that white children and adolescents tend to sleep more than their non-white peers. Intuitively, it makes sense that poorer people sleep less, since they may work longer hours and undergo more stresses related to financial insecurity. Since people of color, especially Black and Hispanic/Latinx people, are disproportionately likely to be poor, it also makes sense that race correlates with sleep deprivation.

In this section, I investigate whether this relationship persists in the Bay Area. I conduct a Census tract-level analysis using data from the CDC’s 500 Cities project and the 2018 American Community Survey 5-year estimates.

2.1 Data setup: 500 Cities data

all_health = read_csv("https://chronicdata.cdc.gov/api/views/6vp6-wxuq/rows.csv?accessType=DOWNLOAD")
ca_health = filter(all_health, StateAbbr == "CA")
ca_sleep = filter(ca_health, MeasureId == "SLEEP")

# Filter to Bay Area census tracts
ca_tracts = tracts("CA", cb = T, progress_bar = F)

bay_sleep = ca_sleep %>%
            filter(!is.na(TractFIPS)) %>%
            filter(!is.na(Data_Value)) %>%
            mutate(perc_sleepdep = Data_Value) %>%
            left_join(ca_tracts %>% select(GEOID),
                      by = c("TractFIPS" = "GEOID")) %>%
            st_as_sf() %>%
            st_centroid() %>%
            .[bay_counties, ] %>%
            st_set_geometry(NULL) %>%
            left_join(ca_tracts %>% select(GEOID),
                      by = c("TractFIPS" = "GEOID")) %>%
            st_as_sf()

bay_sleep

2.2 Mapping sleep deprivation

sleep_pal = colorNumeric(palette = "Purples",
                         domain = bay_sleep$perc_sleepdep)

leaflet() %>%
  addTiles() %>%
  addPolygons(
    data = bay_sleep,
    fillColor = ~sleep_pal(perc_sleepdep),
    color = "white",
    opacity = 0.5,
    fillOpacity = 0.75,
    weight = 1,
    label = ~paste0(round(perc_sleepdep), "% in ", TractFIPS),
    highlightOptions = highlightOptions(weight = 2,
                                        opacity = 1)) %>%
    addLegend(data = bay_sleep,
              pal = sleep_pal,
              values = ~perc_sleepdep,
              title = "Percentage of residents<br>who sleep less than 7<br>hours per night")

2.3 Data setup: ACS 2018 5-Year Data

We use ACS 2018 5-year data because tract-level data is only available in the 5-year surveys.

acs_vars_2018_5yr = listCensusMetadata(name = "2018/acs/acs5",
                                       type = "variables")

bay_income_race = getCensus(name = "acs/acs5",
                            vintage = 2018,
                            region = "tract:*",
                            regionin = "state:06+county:001,013,041,055,075,081,085,095,097",
                            vars = c("B19001A_001E",
                                     "B19001_001E",
                                     "B19013_001E")) %>%
                  transmute(tract = paste0(state, county, tract),
                            perc_white = B19001A_001E / B19001_001E * 100,
                            med_income = B19013_001E) %>%
                  filter(!is.na(perc_white),
                         !is.na(med_income))

bay_income_race

2.4 Merging the datasets

bay_sleep_income_race = bay_income_race %>%
                        left_join(bay_sleep %>% select(TractFIPS, perc_sleepdep),
                                  by = c("tract" = "TractFIPS")) %>%
                        filter(!is.na(perc_sleepdep))

bay_sleep_income_race

2.5 Outliers in income variable

# Scatterplot

ggplot(data = bay_sleep_income_race,
       aes(x = med_income,
           y = perc_sleepdep)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

It appears that there are a couple of outliers in the income variable that completely skewer the regression. Let’s examine what the outliers are and where they are located.

# Finding the outliers

outlier_exam = bay_sleep_income_race %>%
               st_as_sf() %>%
               arrange(med_income) %>%
               .[c(1,2),]
outlier_exam
# Mapping the outliers

outlier_exam %>% leaflet() %>%
                 addTiles() %>%
                 addPolygons()

As it turns out, the outliers are both non-residential areas in San Francisco (Golden Gate Park and an industrial area around Islais Creek), and the encoding suggests that income is not available in those two areas. I drop those two values and use this adjusted dataset to perform all further analyses.

2.6 Dropping outliers

bay_sleep_income_race = bay_sleep_income_race %>%
                        filter(med_income > 0)

2.7 Single regression: Income vs. sleep deprivation

# Regression

model = lm(perc_sleepdep ~ med_income, bay_sleep_income_race)
summary(model)
## 
## Call:
## lm(formula = perc_sleepdep ~ med_income, data = bay_sleep_income_race)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.398  -1.873  -0.004   1.725  10.063 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.978e+01  2.231e-01  178.30   <2e-16 ***
## med_income  -6.216e-05  2.026e-06  -30.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.795 on 1092 degrees of freedom
## Multiple R-squared:  0.4629, Adjusted R-squared:  0.4624 
## F-statistic:   941 on 1 and 1092 DF,  p-value: < 2.2e-16
# Scatterplot

ggplot(data = bay_sleep_income_race,
       aes(x = med_income,
           y = perc_sleepdep)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Now that the outliers have been removed, it is clear that there is a negative association between sleep deprivation and income. Poorer people are more likely to be sleep-deprived; richer people are less likely. This relationship is significant.

2.8 Single regression: Race vs. sleep deprivation

# Regression

model = lm(perc_sleepdep ~ perc_white, bay_sleep_income_race)
summary(model)
## 
## Call:
## lm(formula = perc_sleepdep ~ perc_white, data = bay_sleep_income_race)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1006 -2.2562 -0.4398  2.2207  8.6396 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.238804   0.235387  166.70   <2e-16 ***
## perc_white  -0.112826   0.004236  -26.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.969 on 1092 degrees of freedom
## Multiple R-squared:  0.3938, Adjusted R-squared:  0.3933 
## F-statistic: 709.4 on 1 and 1092 DF,  p-value: < 2.2e-16
# Scatterplot

ggplot(data = bay_sleep_income_race,
       aes(x = perc_white,
           y = perc_sleepdep)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

There is a negative association between the percent of white people in a tract and the percent of sleep-deprived individuals. This relationship is also significant.

2.9 Multiple regression

# Correlation plot

correlationplot = bay_sleep_income_race %>%
                  select(perc_sleepdep,
                         med_income,
                         perc_white) %>%
                  cor()
corrplot(correlationplot,
         method = "number",
         type = "upper")

There is a relationship between median income and percentage of white people in each census tract, but the relationship is not strong enough that I am concerned about collinearity for this analysis.

# Regression

model = lm(perc_sleepdep ~ med_income + perc_white, bay_sleep_income_race)
summary(model)
## 
## Call:
## lm(formula = perc_sleepdep ~ med_income + perc_white, data = bay_sleep_income_race)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6363 -1.5603 -0.0685  1.5055  7.7371 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.260e+01  2.159e-01  197.30   <2e-16 ***
## med_income  -4.866e-05  1.737e-06  -28.02   <2e-16 ***
## perc_white  -8.169e-02  3.417e-03  -23.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.265 on 1091 degrees of freedom
## Multiple R-squared:  0.6475, Adjusted R-squared:  0.6469 
## F-statistic:  1002 on 2 and 1091 DF,  p-value: < 2.2e-16

Echoing the single regression results above, both income and race have a significant impact on sleep deprivation rates. An income increase in $10,000 corresponds to a 0.4 percentage point decrease in sleep deprivation rates; a 1 percentage point increase in the percentage of white people corresponds to a 0.08 percentage point decrease in the percentage of sleep deprivation rates.

Overall, the national trends reported in the literature are echoed in the Bay Area.